## 
### Generate Topic Matrix for each topic model
Before running anything else, one needs to first run `python generate_topic_mat.py` to generate the topic matrix for all four topics models we experiment with. 

### Synthetic document experiments
The folders `src_Pure`, `src_LDA`, `src_CTM`, `src_PAM` are for pure, LDA, CTM, and PAM synthetic experiments for the t=1 case, respectively. The folders `src_Pure_t=2`, `src_LDA_t=2`, `src_CTM_t=2`, `src_PAM_t=2` are for pure, LDA, CTM, and PAM synthetic experiments for the t=2 case, respectively. In each folder, run `main.py` with the following arguments to start synthetic experiment.

`python3 main.py [--tests (number of test documents)] [--times (number of times t target words are sampled per training document)] [--hdim (neural network width)] [--layers (neural network depth)] [--epochs (number of training epochs)] [--doclength (average document length)] [--save_model] [--constrained_opt] [--reuse_train_data] [--gen_new_tests]` 

For example, `python main.py --tests 200 --times 6 --hdim 768 --layers 8 --epochs 100 --doclength 60 --save_model` specifies that a model with 768 hidden dimension and 8 layers will be initialized and trained for 100 epochs, and for each training document 6 prediction targets are sampled, where each prediction target consists of `t` masked words (`t=1` or `t=2` depending on `--two_targets`, by default `t=1`), and the model's performance on TV will be evaluated on 200 test documents. If `--save_model` is used, the final neural network model weights will be saved in the `savedmodels` directory. 

### Evaluate topic posterior recovered by SSL
To evaluate each saved model's performance on the test documents, one can run `metrics.py` to measure the TV distance and major topics recovery rate of the topic posterior recovered by SSL. Before running `metrics.py`, make sure to change the variable `num_words` and `N_test` to length of each test document and the total number of test documents, respectively. For instance, to evaluate the models from the above example, set `num_words=60` and `N_test=200`. The document representation learned by SSL and the recovered topic posteriors will be stored in the folder `recovered`. In folder `src_pure`, `src_LDA`, `src_CTM`, `src_PAM`, one can also run `metrics_baseline.py` to measure these metrics for the posterior inferred by MCMC and Variational Inference. Again, make sure to change the variable `num_words` and `N_test` to length of each test document and the total number of test documents.

### Visualization
The code for visualizing our results in section 5 and appendix section C can be found under the `synthetic_experiment_visualization` folder. 

